Task Fusion #113

shivsundram · 2021-12-06T21:42:05Z

PR for Task Fusion

Task Fusion PR guide

This PR implements:

Task Fusion in the Legate Core
i. Fusibility determination and creation of the fused ops, done in the python code (see /core/runtime.py)
ii. Issuing of a Fused task's sub-operations, currently done in cunumeric c++ code. This can be viewed in the companion PR Task Fusion, Constant Conversion Optimization, and 27pt stencil benchmark cupynumeric#150
iii. An interface for accessing each registered Legate Tasks' function pointer, used for efficiently launching a fused task's sup-tasks
Basic constant optimization
i. eliminate issuing of expensive convert tasks for static/constant scalars (ie there's no need to defer execution of this type conversion, when the underlying Future's value is a constant embedded in the code)
ii. This lives in cunumeric, but the changes can be viewed here (will be contributed in separate PR)
Some changes to the benchmarks
i. re-organization of the synchronizations in black_scholes. Instead checking the calls and puts!=NaN at the end of every benchmarking run, we check this only after all benchmarking runs have concluded.
ii. A new 27-pt stencil benchmark. For demonstrating how fusion boosts perf even though it theoretically decreases task-level parallelism
iii . Lives in cunumeric, but can be viewed in companion PR here
TO_REMOVE: A C++ utility API for serializing a Legate Task, for issuing a legate task from another legate task - this will be removed from this PR, so feel free to ignore this code for now

Required for merge (ie TODO):
-Test OpenMP Support

Note
I am currently unable to move the Fused Task to the core; as doing this causes issues with the fused tasks's mapper tag. Until then, the Fused Task will live in Cunumeric, in the companion PR nv-legate/cupynumeric#150

Overview of Optimizations

Task Fusion
Given a window of buffered Legate tasks, fusion finds sub-windows of operations that can each be aggregated into a single launch/fused operation. In other words, given a 100% fusable window of N tasks, the runtime would normally issue O(N) meta/utility tasks (mapping calls, scheduling etc) for this window, whereas the fused operation would only require O(1) such meta-tasks, thus reducing overhead. There are currently four constraints dictating op fusability:

Fusability Constraints
1 Fusable Operations: Only a certain subset of Legate tasks are fusable (eg unary, binary, ternary ops)
2 Producer-Consumer: In the same fused op, one cannot write to one store representing an underlying buffer/regionfield, and then read from a different store representing the same underlying buffer/regionfield. This prevents the existence of invalid combinations of aliased partitions in a fused operation (eg the reads, adds, and mults in an in-place stencil are fusable, but not the write/update of the buffer at the end of the stencil iteration)
3 Launch Shape equivalence: All sub-operations in a fused task must have the same launch shape.
4 Projection Functor equivalence: If a store is used by mutliple sub-ops in a fused task, every subop's functor for indexing into that store must be equivalent.
5(Temporary) Existence of CuNumeric: Until the FusedTask is moved to the core, we can only fuse if the cuNumeric context exists (note the Fused Task being in cuNumeric does not preclude it from fusing non-cuNumeric tasks)

All constraints live in core/runtime. Each constraint's apply function is used to apply the constraint to a window of ops. This, by Dec 8, will be replaced with a new set of functions which apply constraints more efficiently (most of these are already in the PR and currently called apply2, but will just be called apply eventually)

_Currently, Constraint 4 is the most expensive to run, and it doesn't seem to ever trigger. If we can come up with a better way of testing this, runtime will further improve

Constant Optimization
Currently implemented in cunumeric/array.py, when issuing binary_ops. Instead of creating a convert operation to convert a known scalar constant (eg from fp64 to tp32), we simply create the scalar in place and eliminate the need for said operation. Not only does this enable larger levels of fusion (as convert is not a fusable op), but the operation in itself provides large performance boosts. Note this only triggers when the constant is a DeferredArray (not Eager, as this optimization actually slows down ops using EagerArrays )

Performance:

Depending on input sizes and GPU counts (1-16 currently tested) and GPU type(Pascal, Volta currently tested), we see
~1.6-2.x speedup for black scholes
~1.6-2x speedup for 27pt stencil
~1.15-2x speedup for Logistic Regression
~1.15 speedup for Jacobi Iteration
No statistically significant speedup for conjugate gradient (between 1.0x-1.05x)

The maximum window_size in both the fused and unfused baseline (ie nv-legate reference) is set to 50 in these tests. Speedup is even higher when the baseline uses a window_size of 1 (current setting in reference), but the fused code still uses n=50 (current setting in PR)

Testing

Currently running (have been testing as I go). Have not seen a single instance where fused and unfused generate different outputs.

Known Issues:

In 1 benchmark ( conjugate gradient, only noticed on Lassen, not Sapling), we see <0.5% performance degradation with fusion turned on. This is likely not statistically significant, but if concerning I can design an additional constraint to prevent fusion here. Very little gets fused in this example, so the overhead of making the fused op becomes more tangible (specifically the few fused ops are all O(n), and the unfusable ops are all O(n^2))
I have seen one instance where a fused task hangs, specifically on a >=4 node count on Lassen for logistic regression (for smaller node counts it works fine). On my branch this happens with fusion turned off as well, and is likely due to one of the smaller changes I made outside the fusion code. I'm investigating. Worth noting that with the given problem size, the reference implementation actually generates "inf" floating point values.

…sion.

object cycles

Object cycle fix

Fix for Issue 77

shivsundram · 2021-12-06T22:32:10Z

legate/core/runtime.py

+        #self.validIDs.add(5) #Convert op
+        #self.validIDs.add(7) #dot op
+        #self.terminals.add(7) #dot op is only fusable as a terminal op in a window
+


I initially introduced the notion of a 'terminal' op that is fusable only at the end of a window. I don't use this so will remove it.

shivsundram · 2021-12-06T22:33:06Z

legate/core/runtime.py

+            intervals.append((start,end))
+        #print("proj", intervals)
+        return True, intervals
+


As stated, the above constraint isn't particularly cheap....and it doesn't seem to ever be "used" I will look into the perf, but I suspect the get_projection method is the culprit here

shivsundram · 2021-12-06T22:33:36Z

legate/core/runtime.py

+        fusion_checker.register_constraint(IdenticalProjection())
+        fusion_checker.register_constraint(ValidProducerConsumer())
+        can_fuse,fusable_sets, partitions = fusion_checker.can_fuse()
+


fusion constraints are registered above

shivsundram · 2021-12-06T22:34:19Z

legate/core/runtime.py

+                        frint("launch ufused input", op._task_id, input)
+                self.propogateFuture(op)
+                op.launch(op.strategy)
+


the above func will be cleaned up and simplified in the next day or so

shivsundram · 2021-12-07T23:21:36Z

legate/core/solver.py

+                if store._key_partition is not None:
+                    partition=store._key_partition
+                else:
+                    partition = store.compute_key_partition(restrictions)


If the store already has a key partition (set by the fuser), no need to compute it
There may be a more elegant way of doing this

You should check if the key partition satisfies the restrictions before you reuse it. And I think this logic should be moved to compute_key_partition. Just so you know, I'm currently refactoring the code that manages legate stores, so don't make any micro optimization like this, as that will become obsolete in the near future.

shivsundram · 2021-12-07T23:24:15Z

src/core/data/store.cc

-    fprintf(stderr, "Found an uninitialized Legate store\n");
-    assert(false);
+    //fprintf(stderr, "Found an uninitialized Legate store\n");
+    //assert(false);


I will come back to this; this was done to make constant optimization work; I will check if I indeed need this assertion removed/modified

shivsundram · 2021-12-13T00:52:28Z

install.py

@@ -794,8 +794,8 @@ def driver():
    )


all changes to this file will be reverted before merge

shivsundram · 2021-12-13T02:29:54Z

setup.py

-        name="legate.core",
-        version="0.1",
+        name="legate-core",
+        version="21.10.00",


Is the current version indeed 0.1? If so I'll undo the above change

magnatelee · 2021-12-14T01:05:43Z

legate/core/launcher.py

    def build_task(self, launch_domain, argbuf):
        self._req_analyzer.analyze_requirements()
        self._out_analyzer.analyze_requirements()

+        #pack fusion metadata
+        self.pack_fusion_metadata(argbuf, self._is_fused, self._fusion_metadata)        


I'm not a fan of this design. The fusion task shouldn't be any different from a normal Legate task. I believe the fusion metadata can be passed as a bunch of scalar arguments to the fusion task. If we do that, then we don't need to make every place in the core handle fusion. Making the default path aware of task fusion doesn't sound like a good design.

This is def worth discussing (I pondered this myself):
I did think about making them scalars, but was not sure if there were any other implications/downsides of storing all the metadata that way (eg are there a maximum number of scalars we can give a task), since we'd be sending in quite a few scalars to the task (or rather multiple lists, each of which can have >50 scalars; the number of scalars in 4/5 of these lists is equivalent to the fused_op length, even with potential deduplication of stores).
If we did this, in the task's scalars array, we'd thus be mixing "metadata" scalars (which are really a bunch of lists) with the actual scalars used by the sub-tasks, which seemed to not really adhere to the abstraction of having a dedicated "scalars" array in context data structure. I thus chose to serialize it as Legate Task argument data. As you stated, the downside is that the core/default has to be "fusion aware".

But if you don't see any potential downsides of making them all scalars, then yeah it could totally be implemented to be scalar based. Would mostly just have to move book-keeping logic from the deserializer into the Fused Legate Task

I feel what's actually missing here is an ability to pass an arbitrary chunk of data (probably a memoryview object in Python) to a task, which in theory possible, so we should probably extend the buffer builder interface to take an an arbitrary memory view object and pack it as a single contiguous chunk. The assumption here is that the task knows how to interpret those bytes. It shouldn't be too difficult for me to add it.

magnatelee · 2021-12-14T01:11:54Z

src/core/runtime/context.h

  TaskContext(const Legion::Task* task,
              const std::vector<Legion::PhysicalRegion>& regions,
              Legion::Context context,
              Legion::Runtime* runtime);

+  TaskContext(const Legion::Task* task, const std::vector<Legion::PhysicalRegion> regions)


Why is this constructor necessary?

Each sub-op is called like this (via function-pointer):

for (int i=0; i<fused_op_len; i++){ auto descriptor = Core::gpuDescriptors.find(opIDs[i])->second; descriptor(context); }

However, the context above is not the same as the parent tasks's context; Whereas the parent task's context contains the "union" of all input outputs reductions and scalars over all subops, the above context should only include the inputs, outputs etc for sub-operation/task i. I used this constructor to build the operation-specific contexts, for holding the specific set of inputs outputs etc.

This brings up a potential issue of that the new context's task and regions fields are currently not specific to the sub-op. I'm not sure how much of an issue that is, as I only need the context structure to hold stores and scalars for the sub-ops. Ideally I wanted to avoid any performance costs of creating of new Task objects in the Fused Task, so I elected to keep the new context object as such.

You do need a new context in this case, but my question is more on why you couldn't use the existing constructor. The constructor you added leaves the runtime and context member uninitialized, leading to a crash if the child task tries to access the runtime (though no task in cunumeric is actually doing that). I think the context you are creating here should still have a valid runtime and context pointer.

That said, I think you actually need a slightly different change here, which allows TaskContext to override the legion task's id. The task object here is of the fusion task, so if you pass it as-is to a child task, the child task may think that it's not what it thinks it is. I believe no cuNumeric task is doing actually doing it, but leaving this unaddressed can be a foot gun later on.

Yeah it's a valid point. I initially tried using the normal constructor and got runtime errors with from the task deserializer. Recreating the issue I get: ( the error is from span.h : Assertion pos < size_ failed.)

legion_python: /home/shiv1/fmultiGPU/legate.core/src/core/utilities/span.h:38: decltype(auto) legate::Span<T>::operator[](size_t) [with T = const Legion::PhysicalRegion; size_t = long unsigned int]: Assertion `pos < size_' failed. Signal 6 received by node 0, process 3131805 (thread 7f8d47bf3000) - obtaining backtrace Signal 6 received by process 3131805 (thread 7f8d47bf3000) at: stack trace: 19 frames [0] = /lib/x86_64-linux-gnu/libpthread.so.0(+0x153c0) [0x7f8d70ce33c0] [1] = /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xcb) [0x7f8d707f718b] [2] = /lib/x86_64-linux-gnu/libc.so.6(abort+0x12b) [0x7f8d707d6859] [3] = /lib/x86_64-linux-gnu/libc.so.6(+0x25729) [0x7f8d707d6729] [4] = /lib/x86_64-linux-gnu/libc.so.6(+0x36f36) [0x7f8d707e7f36] [5] = /home/shiv1/fmultiGPU/legate.core/install/lib/liblgcore.so(+0x5a95d) [0x7f8d4428595d] [6] = /home/shiv1/fmultiGPU/legate.core/install/lib/liblgcore.so(legate::TaskDeserializer::_unpack(legate::Store&)+0x10e) [0x7f8d44286d9e] [7] = /home/shiv1/fmultiGPU/legate.core/install/lib/liblgcore.so(legate::TaskContext::TaskContext(Legion::Task const*, std::vector<Legion::PhysicalRegion, std::allocator<Legion::PhysicalRegion> > const&, Legion::Internal::TaskContext*, Legion::Runtime*)+0x7f6) [0x7f8d44278626] [8] = /home/shiv1/fmultiGPU/legate.core/install/lib/libcunumeric.so(cunumeric::FusedOpTask::gpu_variant(legate::TaskContext&)+0x6e2) [0x7f8d1ce0a852]

at the time I didn't spend much time debugging this and just wanted something that worked for the time being (the constructor I made didn't call the deserializer. I didn't want to reserialize/deserialize things anyways because it involves additional overhead). I'm guessing it has to do with the number futures or scalars passed in, but I can totally dive a little deeper into what's causing this

I suspect the crash is due to the fusion metadata you added that the current serializer doesn't understand, so this issue is tied with the fusion awareness introduced in the code base, which I suggest removing. Actually, we shouldn't be deserializing the task argument for the child task context, so you're right that there be a constructor that doesn't serialize the argument. I guess we could add an argument to the existing constructor to control the behavior.

magnatelee · 2021-12-14T01:12:47Z

src/core/runtime/context.h

@@ -121,13 +130,14 @@ class TaskContext {
 public:
  ReturnValues pack_return_values() const;

- private:
+ public:


Why did you want to expose the private members here?

ahh this should be reverted. This is an artifact from when we were using the inline launcher, and I needed access to these

magnatelee · 2021-12-14T01:21:05Z

legate/core/runtime.py

+        input_start, output_start, offset_start = 0,0,0
+        reduction_start, scalar_start, future_start = 0,0,0
+
+        for op in ops:


I guess we'd eventually want coalescing of arguments here as well. Right now, the fusion code passes the same store used by multiple tasks multiple times, hoping that the launcher at least coalesces region requirements for those duplicates. But, we can identify that the same store is being used in multiple tasks when we scan them for dataflow analysis. So, we could potentially pass a deduplicated list of stores to the fusion task with a mapping between the original indices and those in the store list that is passed to the fusion task.

Yeah we could deduplicate these, and I don't imagine this being too difficult. We would just have to place the offset of the dedpulicated store (eg in line 1145) in the offsets array, and it should just work

magnatelee · 2021-12-14T01:23:30Z

Thanks a lot for all your hard work on this PR. I've left several comments and questions, but you don't necessarily need to address all of them. I'll pick this up probably after the break and we can work together to make this ready for the merge.

shivsundram · 2021-12-14T02:23:09Z

legate/core/runtime.py

-            ty.uint32,
-        )
+            ty.uint32,)
+        self._window_size=50


this shouldn't be hardcoded, will remove

marcinz · 2023-01-26T01:09:13Z

@shivsundram @magnatelee What is the status of this PR?

magnatelee · 2023-01-26T15:55:28Z

@marcinz I'm planning to port/rebase this once the planned refactoring on the core is done.

shivsundram and others added 30 commits September 27, 2021 15:06

task fusion and legality constraints

8ff0261

makeshift serializer for inline ops

4a9248f

reductions scalars, opids, need to remove dynamic allocations

99cd51c

fusion metadata passed via serialization now

a4e21d8

remove redundant store partitioning

9b57fc8

remove creation of deferred arrays

bf7973a

optimized packing, some transform packing

b6121e7

more stuff

e3708bf

partial fusion

8eca06f

merge attempt

dbf8ea9

second merge attempt

a2c3874

finishing merge

303866b

fix future stuff

e8b9021

merge conflict

7b0ee30

re add serializer, fix horrible merge bug

8238083

debugging crap

46856c7

trying merge

6b4182b

op registry working

db65e43

Change the pip package name to match the conda package and update ver…

ddd8dc7

…sion.

Fix the version of Legion to a particular commit

ef677e6

Do not find a default branch for the release

ba955e2

Change the Legion checkout target

de337cf

Bumped up the version of pyarrow

78335cc

gpu descriptors

369909d

Remove back edges from partition symbols back to operations to avoid

ab0c044

object cycles

Merge branch 'branch-21.10' into cycle-fix

a9e1014

Make sure we don't create cycles between region fields and attachments

54d3bb8

Merge pull request nv-legate#84 from magnatelee/cycle-fix

dd46bdd

Object cycle fix

Merge pull request nv-legate#78 from lightsighter/interpreter-check

02103ad

Fix for Issue 77

Handle cases where one instance is used by multiple mappings

d6ccdd2

shivsundram commented Dec 6, 2021

View reviewed changes

Shiv Sundram added 2 commits December 7, 2021 14:53

minor cleanup

fe66dad

more cleanup

a73dec1

shivsundram commented Dec 7, 2021

View reviewed changes

Shiv Sundram added 2 commits December 12, 2021 11:37

use alignment info when fusing

437e67e

new apply methods

9594cb5

shivsundram commented Dec 13, 2021

View reviewed changes

install.py

@@ -794,8 +794,8 @@ def driver():

)

Copy link

Author

shivsundram Dec 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all changes to this file will be reverted before merge

Shiv Sundram added 5 commits December 12, 2021 17:01

removing serializer code

239ae35

more cleanup

b49557b

remove fusion reference from core

e05e3ff

remove comments

a59e142

more cleanup

399d070

shivsundram mentioned this pull request Dec 13, 2021

Task Fusion, Constant Conversion Optimization, and 27pt stencil benchmark nv-legate/cupynumeric#150

Open

shivsundram commented Dec 13, 2021

View reviewed changes

magnatelee reviewed Dec 14, 2021

View reviewed changes

shivsundram commented Dec 14, 2021

View reviewed changes

legate/core/runtime.py

ty.uint32,

)

ty.uint32,)

self._window_size=50

Copy link

Author

shivsundram Dec 14, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this shouldn't be hardcoded, will remove

Shiv Sundram and others added 2 commits April 20, 2022 15:35

partitioning fix

5bb4df5

choose midpoint partition

36c40cb

marcinz changed the base branch from branch-22.01 to branch-24.01 November 9, 2023 16:54

marcinz changed the base branch from branch-24.01 to branch-24.03 February 22, 2024 00:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Task Fusion #113

Task Fusion #113

shivsundram commented Dec 6, 2021 •

edited

Loading

shivsundram Dec 6, 2021

shivsundram Dec 6, 2021

shivsundram Dec 6, 2021

shivsundram Dec 6, 2021

shivsundram Dec 7, 2021

magnatelee Dec 14, 2021

shivsundram Dec 7, 2021 •

edited

Loading

shivsundram Dec 13, 2021

shivsundram Dec 13, 2021

magnatelee Dec 14, 2021

shivsundram Dec 14, 2021 •

edited

Loading

shivsundram Dec 14, 2021 •

edited

Loading

magnatelee Dec 14, 2021

magnatelee Dec 14, 2021

shivsundram Dec 14, 2021 •

edited

Loading

magnatelee Dec 14, 2021

shivsundram Dec 14, 2021

shivsundram Dec 14, 2021 •

edited

Loading

magnatelee Dec 14, 2021 •

edited

Loading

magnatelee Dec 14, 2021

shivsundram Dec 14, 2021 •

edited

Loading

magnatelee Dec 14, 2021

shivsundram Dec 14, 2021 •

edited

Loading

magnatelee commented Dec 14, 2021

shivsundram Dec 14, 2021

marcinz commented Jan 26, 2023

magnatelee commented Jan 26, 2023

Task Fusion #113

Are you sure you want to change the base?

Task Fusion #113

Conversation

shivsundram commented Dec 6, 2021 • edited Loading

PR for Task Fusion

Task Fusion PR guide

Overview of Optimizations

Performance:

Testing

Known Issues:

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shivsundram Dec 7, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shivsundram Dec 14, 2021 • edited Loading

Choose a reason for hiding this comment

shivsundram Dec 14, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shivsundram Dec 14, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shivsundram Dec 14, 2021 • edited Loading

Choose a reason for hiding this comment

magnatelee Dec 14, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shivsundram Dec 14, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shivsundram Dec 14, 2021 • edited Loading

Choose a reason for hiding this comment

magnatelee commented Dec 14, 2021

Choose a reason for hiding this comment

marcinz commented Jan 26, 2023

magnatelee commented Jan 26, 2023

shivsundram commented Dec 6, 2021 •

edited

Loading

shivsundram Dec 7, 2021 •

edited

Loading

shivsundram Dec 14, 2021 •

edited

Loading

shivsundram Dec 14, 2021 •

edited

Loading

shivsundram Dec 14, 2021 •

edited

Loading

shivsundram Dec 14, 2021 •

edited

Loading

magnatelee Dec 14, 2021 •

edited

Loading

shivsundram Dec 14, 2021 •

edited

Loading

shivsundram Dec 14, 2021 •

edited

Loading